blog-banner

How to export data with changefeeds

Last edited on June 21, 2023

0 minute read

    Exporting data is a crucial tool in any database user’s toolkit. In CockroachDB, the EXPORT command has long provided this essential functionality to:

    • Move data to a different data store for business analytics

    • Migrate data to a new database

    • Archive data in a platform-neutral way

    • Seed an application with data

    Say, for example, you need to export a sizable amount of JSON to seed a core data store for a streaming service. Sometimes you can pre-seed this data alongside your application with an EXPORT. But if you’re working on the scale of a company like Netflix, EXPORT commands come with some limitations that led us to explore an additional way to export data.

    Problems with EXPORTCopy Icon

    First of all, EXPORTs have some configurability and integration limitations — but these are minor issues compared to two other problems:

    • EXPORT commands have a natural scale limit: above a certain table size (TBs of data), EXPORTs are likely to run into snags that cause them to restart. Since they must restart from the beginning, it is unlikely a very high data scale EXPORT will ever finish.

    • EXPORTs have limited observability compared to

      jobs like schema changes and backups, since they aren’t actually jobs themselves.


    RELATED

    When (and why) you should use change data capture


    There is a solution to these problems. In CockroachDB you can export data using changefeeds (available in CockroachDB Enterprise, our fully managed offering, CockroachDB dedicated, and CockroachDB serverless).

    (CockroachDB Core users are invited to try an experimental feature: EXPERIMENTAL CHANGEFEED FOR. This statement allows users to create a new changefeed to stream row-level changes to the client indefinitely, until the underlying connection is closed or the changefeed is canceled).

    How to export data with changefeedsCopy Icon

    Enterprise change data capture (CDC) provides row-level change subscriptions for downstream processing. In CockroachDB users can create a new changefeed to stream row-level changes in a configurable format to a configurable sink such as Apache Kafka.

    CREATE CHANGEFEED FOR table INTO [kafka, webhook, cloud storage, gc pubsub] WITH format=csv, initial_scan=only;

    The CREATE CHANGEFEED statement creates a new CockroachDB enterprise changefeed, which targets an allowlist of tables called “watched rows”. Every change to a watched row is emitted as a record in a configurable format (JSON or Avro) to a configurable sink (Kafka, Google Cloud Pub/Sub, a cloud storage sink, or a webhook sink). You can create, pause, resume, alter, or cancel an Enterprise changefeed.

    In the base case, changefeeds send a stream of continuous updates to a table via a SQL connection or external system. However, using the initial_scan=only option, a changefeed will emit a snapshot of the table and the job will complete.

    Using changefeeds allows us to take advantage of a wider set of integrations and configurability, as well as the robust-at-scale changefeed architecture.

    Automatic data scalingCopy Icon

    The huge advantage: in CockroachDB changefeeds are built to scale organically with your data. Here’s how it works:

    • Changefeeds take advantage of parallel processing of data

    • Changefeeds use the jobs system, which comes with enhanced observability and scale infrastructure

    • Changefeeds utilize checkpointing to remain robust as data scales out

    Let’s compare CREATE CHANGEFEED to a traditional EXPORT:

    For more details about the differences in configurability, check out the export and enterprise changefeed docs.

    Changefeeds offer a great alternative to EXPORTs for users at a high scale of data. There are still a few limitations compared to the EXPORT command, but changefeeds are quickly approaching feature parity. Moreover, changefeeds offer many configuration and destination options beyond those offered by EXPORTs.

    CDC
    SQL
    Application Performance
    Schema Changes
    Dataflow Processing
    Kafka